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THE  THEORY  OF  INFORMATION 

I.  Introductory  Remarks 

I.  Scope  and  Continuity  of  this  Report 

This  report  discusses  the  modern  theory  of  2-point  unidirectional 
communication  that  ia  associated  with  the  names  of  Shannon  and  Wiener  in 
the  light  of  Shannon's  Theory  of  information.  While  being  for  the  most 
part  an  outline  of  Shannon's  classical  paper  (22),  the  report  also  sketches 
some  applications  and  presents  a  discussion  on  the  question  of  uniqueness 
of  formulation  of  the  theory  of  information. 

In  an  attempt  not  to  obscure  the  underlying  train  of  thought, 
some  of  the  mathematical  proofs  are  heuristic  in  nature.  The  theory's 
present  state  makes  this  ine.itable  anyway. 

The  block  diagram  below  summarizes  the  continuity  of  the  paper 


Fig.  1  Continuity  Diagram 


2.  General  Remarks  on  the  Theory  of  Information 

During  the  last  decade  or  so  it  has  been  realized  that  communi¬ 
cation  in  the  presence  of  noise  is  a  problem  susceptible  to  treatment  by 
the  methods  of  probability  theory.  In  such  treatments  we  have  all  been 
accustomed  to  the  frequent  use  of  such  scalars  as  the  second  moments  of 
distributions,  etc.  Shannon  has  shown  the  great  usefulness  of  defining 
another  scalar,  called  the  information  rate,  and  has  built  up  a  theory  of 
communication  in  which  information  rate  plays  a  fundamental  part.  The 
crux  of  the  theory  is  that  information  rate  is  a  scalar  capable  of  charac¬ 
terizing  a  source  in  such  a  manner  as  to  specify  the  speed  at  ’-'hich  source 
messages  are  to  be  transmitted  in  order  that  they  may  be  received  without 
error  in  spite  of  the  presence  of  a  given  intervening  noise.  (Actually  it 
is  usually  possible  only  to  transmit  with  an  arbitrarily  small,  but  non¬ 
zero,  probability  of  error,  but  this  is  a  fine  point  of  the  type  that  will 
henceforth  be  overlooked. ) 

In  Shannon's  paper  information  rate  is  introduced  by  first 
defining  a  quantity  called  information  which  is  shown  to  warrant  that  name 
because  it  satisfies  many  of  the  intuitive  requirements  for  such  a  quantity. 
For  the  sake  of  variety,  a  different  approach  is  used  in  this  report.  We 
postulate  certain  requirements  for  a  scalar  which  is  to  be  called  infor¬ 
mation  rate,  and  show,  by  assuming  certain  restrictions,  that  the  postulates 
imply  a  unique  formulation.  This  "uniqueness  theorem"  gives  some  insight 
into  the  fact  that  information  rate  seems  to  be  of  such  fundamental  im¬ 
portance  not  only  for  the  problem  of  two-point  communication,  but  for 
broader  fields  as  well 


(1)  Cf.  references  4,  f,  19 
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As  regards  applicability  of  the  theory  to  design  of  specific  com¬ 
munication  links  as  well  as  appraisal  of  existing  links,  such  attempts 
usually  turn  out  to  be  discouragingly  difficult.  In  this  respect  infor¬ 
mation  theory  can  be  compared  to  electromagnetic  theory  where  the  analytic 
work  involved  in  solving  specific  problems  is  often  forbidding.  In 
information  theory  the  fundamental  "undefined"  variables  are  "emitted  symbol" 

< 

and  "received  symbol".  The  existence  of  noise  in  the  transmitting  channel 
is  taken  care  of  in  the  theory  by  not  requiring  that  the  received  symbols 
be  the  same  as  the  symbols  emitted  by  the  source,  but  only  that  there  be 
a  statistical  dependence  between  the  two.  The  fundamental  problem  is  to 

"code"  the  emitted  symbols  in  such  a  way  as  to  best  combat  the  noise.  i 

In  order  to  take  into  account  the  fact  that  the  recipient  may  not  be 
interested  in  all  the  detail  of  the  emitted  symbols  the  concept  of  "fi¬ 
delity"  is  introduced.  It  is  evident  that  a  vast  number  of  problems 
arising  in  technology  can  be  described  in  terms  of  information  theory  by 
posing  the  problem  of  how  to  best  modify  (i.e.  "code"  or  "modulate")  the 
output  of  some  source  (i.e.  "emitted  symbol")  so  as  to  best  suit  the  destined 
recipient,  but  where  there  is  a  chance  that  the  article  transmitted  will  be  distorted 
along  the  way.  The  following  may  briefly  be  cited  as  examples. 

(a)  A  source  emits  real  numbers  between  0  and  1  at  the  rate  of  ten 
per  second,  the  distribution  of  the  number  being  known.  The  recipient  is 
interested  in  knowing  '.e  output  correct  to  three  decimal  places.  The 
transmitting  facilities  are  capable  of  transmitting  only  0’s  and  l's  at 

i 

the  rate  of  25  per  second,  and  are  disturbed  by  noise  in  such  a  way  that  if 
a  certain  symbol  (i.e.  a  0  or  1)  is  transmitted  there  is  a  probability  of 


4 
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74  that  the  wrong  symbol  is  received.  The  question  arises:  Is  it 
possible  to  satisfy  the  recipient's  desire  of  accuracy,  and  if 

so,  how  should  be  source  output  be  coded?  Note  that  although  it  will 
be  necessary  to  represent  the  real  number  in  terms  of  a  sequence  of 
0's  and  l's,  this  does  not  necessarily  mean  that  the  representation 
should  be  of  the  binary  type  (i.e.  base  2  representation).  The  fact 
that  the  noise  corrupts  0's  and  l's  indiscri*inately  and  independently 
would  make  it  likely  that  a  binary  representation,  where  some  digits 
carry  more  weight  than  others,  is  not  as  good  as  a  more  hybrid  type  of 
representation. 

(b)  Speech  is  to  be  transmitted  over  a  channel  having 
a  bandwidth  of  10  cps.  The  transmitter  is  capable  of  delivering  an 
(average)  power  of  1  kw.  The  channel  (including  input  circuit  of  re¬ 
ceiver)  is  permeated  by  white  noise  of  2  watts  intensity.  What  type  of 
modulation  system  should  be  used  if  the  only  criterion  of  fidelity  is 
that  the  transmitted  speech  is  received  in  intelligible  form? 

(c)  A  photo-electric  device  equinped  with  telescope  is 
to  be  capable  of  indicating  on  a  3-P°sition  dial  whether  a  cloud  at 
which  the  telescope  is  pointed  is  predominantly  of  the  cirrus,  stratus, 
or  cumulus  type.  How  should  such  a  device  be  made?  To  fit  this 
situation  into  the  mathematical  model  of  information  theory  is 
necessary  to  make  the  following  interpretations: 

sky  — >  source 
device  — -y  coder 

space  between  dial  and  eye  of  observer  (possibly  also  nervous 
system  of  observer,  etc.) — y  channel 
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fact  that  chauinex  can  transit  only  the  words  "cirrus", 
"stratus",  and  "cumulus",  and  that  these  worus  are  trans¬ 
mitted  without  error  when  so  indicated  on  the  dial - >  channel 

"noise"  characteristic 

To  give  an  indication  of  the  potentialities  of  information 
theory  we  will  now  outline  what  information  theory  "tells  us  to  do"  in 
each  of  the  three  above  cases.  The  appendix  will  show  more  specifically 
how  the  statements  below  follow  from  the  theory  developed  in  the  body 
of  the  report. 

(a)  Information  theory  gives  a  m a therm  tical  sch  s 
for  obtaining  the  optimum  representation  of  the  real  numbers  in  the 
system  using  only  O' s  and  l's.  This  scheme  requires  minimizing  functions 
of  several  variables,  solving  equations,  etc.,  and  could  be  achieved  by  a 
great  deal  of  horse  work.  The  resulting  optimum  system  will  require  an 
"infinite"  delay  at  the  transmitter,  and  thus  would  have  to  involve  a 
storage  tube  or  equivalent  device.  It  is  likely  that  if  a  common-sense 
coding  scheme  were  used  instead,  the  resulting  system,  although  not 
strictly  optimum,  would  have  a  much  greater  chance  of  being  practically 
physically  realizable, 

(b)  Information  theory  tells  us  to  build  a  detector 
capable  of  recognizing  speech  sounds,  and  a  coder  to  code  the  detector 
output  into  samples  of  white  noise.  Information  theory  does  not  give  any 
technically  valuable  hints  as  how  to  build  the  speech  sound  cetector. 

(c)  Inform  ’.ion  theory  tells  us  to  build  the  indicating 
device  but  gives  no  worthwhile  indication  of  how  to  go  about  it, 

ft 
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From  the  above  we  3ee  that  information  theory  is 
in  most  cases  unfortunately  merely  a  device  for  reohrasing  already 
well-realized  technical  difficulties  into  more  generalized  form.  The  main 
selling  point  of  information  theory  is  that  in  reducing  difficulties  to 
a  more  generalized  form  it  may  of  conceptual  help  in  their  solution. 


:i 
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II  Concepts  of  Probability  Theory 


1.  Summary 

Probability  distributions  necessary  for  a  statistical 
description  of  2-point  communication  in  the  presence  of  noise  are 
defined. 

2.  Statistical  Description  of  Unidirectional  2-Point  Com¬ 
munication 

We  are  concerned  with  the  description  of  a  link  made  up  of 
a  source,  prcdu- ing  symbols,  x,  which  are  corrupted  by  noise  into  re¬ 
ceived  symbols,  y. 


tvl  o\ 


Fig.  2  Fundamental  Communication  Link 

It  is  convenient  to  define  "symbol"  in  terms  of  the  actual  output  of 

the  source  in  such  a  way  that  successive  such  symbols  are  independent 

and  are  affected  independently  by  noise.  For  instance,  if  the  source 

is  one  that  produces  letters  of  written  English  in  the  presence  of  a 

noise  that  affects  successive  letters  independently,  a  symbol  should  be 

defined  as  a  group  of  ten  or  more  consecutive  letters,  because  suc- 

(2) 

cessive  such  groups  are  practically  independent  in  written  English'  . 


(2)  See  reference  22.  To  be  on  the  safe  side  it  might  be  necessary  to 
use  considerably  longer  groups  to  eliminate  context. 
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With  the  foregoing  in  mind,  a  message  can  be  defined  as  a  sequence 
of  independent  symbols.  Messages  can  be  described  in  terms  of  the  pro¬ 
babilities  of  their  symbols,  the  probability  of  a  certain  symbol  being 
the  fraction  of  time  it  occurs  in  a  long  message.  The  usefulness  of 
the  probabilistic  approach  is  that  for  many  statistical  sources  occurring 
in  nature  the  probabilities  associated  with  long  messages  from  a  given 
source  are  the  same  for  all  long  messages  from  that  source. 

In  the  presence  of  noise  the  emission  of  a  certain  symbol,  x,  by 
the  source  nay  result  in  the  situation  that  the  corresponding  received 
symbol,  y,  is  not  the  same  of  x.  Physically,  the  noise  occurs  in  the 
transmission  link,  or  "channel".  The  channel  is  described  statistically 
by  associating  a  family  of  transition  probabilities  with  the  noise.  We 

( 

define 

qx(y)dy  =  probability  that  the  received  symbol  will  be  in  the 

region  (y,y+dy)  of  the  symbol  space  if"  the  emitted  symbol  is  x. 

Let  us  also  define 

p(x)dx  -  probability  that  emitted  symbol  is  in  (x,x+dx). 

These  two  distributions  determine  a  joint  probability: 

p(x,y)dxdy  ■  p(x)dxqx(y)dy,  the  probability  that  a  symbol  in  the 
range  (x,x+dx)  will  be  emitted  and  (as  a  result  of  this)  a  symbol  in 
the  range  (y,y+dy)  received. 

Focusing  our  attention  of  the  received  symbols  without  reference  to 
their  prime  cause,  we  see  a  statistical  situation  described  by 

q(y)dy  -  probability  that  a  symbol  in  the  range  (y,y+dy)  is  received. 
It  is  also  possible  to  define  the  inverse  transfer  probability 

p  (x)dx  -  probability  that  the  emitted  symbol  was  in  (x,x-fdx)  if 

Jr  ' 
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the  received  symbol  was  y. 

All  the  distributions  can  be  expressed  in  terns  of  p(x,y)  : 
p(x)  -  Jp(x,y)dy 
9(y)  -  Jp(x,y)dx 
^(y)  -  p(x,y)/p(x) 

Py(x)  -  p(x,y)/q(y)  • 

It  should  be  noted  that  before  the  receipt  of  a  synbol  the  recipient's 
knowledge  of  what  will  be  emitted  is  characterized  by  the  distribution 
p(x),  while  after  the  receipt  of  say  the  symbol  y  =  m  the  relevant 
distribution  as  to  what  was  sent  is  pm(x).  It  is  therefore  natural  to 
think  of  p(x)  as  the  a-priori  distribution,  and  Pm(x)  as  the  distribution 

of  what  was  emitted  a-posterlori  to  receipt  of  m.  To  emphasize  this  we 
will  often  write  p(x)  «•  pQ(x). 

For  cases  in  which  the  variables  assume  only  a  discrete  set  of 
values  the  distributions  can  be  obtained  by  use  of  the  Dirac  delta 
function.  Vie  have 

p(x,y)  -  P(i,j)6(x-i)6(y-j) 
p(x)  -  P(i)6(x-i) 

9(y)  -  Q(j;6(y-j) 

Py(x)  -  Pj(i)6(x-1) 

where  P(i,j),  P(i),  Q(j),  C J ) >  Pj(i)  are  the  analogous  probabilities  for 

the  discrete  symbols. 
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III  Definitions  of  Information  Rates 

1.  Summary 

A  definition  of  information-receipt  rate  is  evolved  from 
fundamental  postulates.  Derived  definitions  are  then  formulated  for 
the  concepts  of  information  rate  of  a  source  and  information- 
transmitting  rate  of  a  channel. 

2.  Information-Receipt  Rate 

Let  I(m)  denote  the  information  obtained  as  the  result  of 
receiving  the  particular  message  y  =  m.  The  following  postulate  sug¬ 
gests  itself: 

Postulate  I:  I(m)  is  a  scalar  which  depends  on  the  a-priori  and 
a-posteriori  distributions  of  what  the  source  emitted: 

I(m)  -  §  [pQ(x),  Pm(*)J 


with  the  property  that 

^[PoCx),  P0(*)]  -  0  . 

The  second  part  of  the  postulate  implies  that  no  information  was 
gained  if  the  a-posteriori  probability  as  to  what  was  transmitted  is 
the  same  as  the  a-priori  one  as  to  what  will  be  transmitted. 

As  the  entire  process  under  consideration  is  a  statistical  one 
it  is  to  be  expected  that  statistical  functions  of  I  will  play  a 
more  important  part  than  I  itself.  We  define  the  "inf ormat ion- 
receipt"  rate  R  as  the  average  amount  of  information  received  per 
symbol,  i.e.  as  the  expected  value  of  I: 

R  -  E  £l(m)J  *»  J q(m)I(m)dra, 

R  should  be  invariant  under  any  transformation  that  merely 
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amounts  to  a  one-to-one  relabeling  of  the  message  symbols  without 
changing  the  fundamental  physical  process;  otherwise  the  information 
obtainable  from  a  message  could  be  changed  by  restating  the  message 
in  a  logically  equivalent  way.  For  instance,  suppose  the  received 

3 

message  is  read  from  a  meter  calibrated  according  to  y  instead  of  y. 

If  the  distributions  p  and  p  are  recalculated  on  the  basis  of  x^ 

o  m 

instead  of  x  the  resulting  value  of  R  should  be  the  sarnie.  Now  a  re¬ 
labeling  of  the  variable  x-*f(x)  transforms  a  distribution  p(x)  into 
the  distribution  p(x)/f'(x)  where  x  =  g(z)  is  the  function  inverse  to 
z  ■  f(x).  Therefore  we  have  the  following  postulate: 

Postulate  II:  The  transformation 

p(x)-4  p(g(x) )/f ' (g(x) )  where  g  is  the  inverse  of  f,  and  P 
generically  represents  all  the  probability  distributions  entering  into 
the  definition  of  R,  leaves  R  Invariant. 

Actually  we  will  not  consider  the  problem  of  finding  the  most 
general  functional  ^  that  satisfies  the  postulates,  because  this 
problem  is  too  difficult,  and  has  not  yet  been  solved  to  the  author's 
knowledge . 

Assumption  I:  I  is  of  the  form 

I  =  J  F(P0(x),  x)dx  -  |  F(pm(x),  x)dx 

where  F  ■  F(u,v)  i3  some  real  function  of  two  real  variables. 

We  can  think  of  J*F(p^(x),  x)dx  as  the  "uncertainty"  associated 
with  the  distribution  p^(x).  Then  the  restricted  class  of  definitions 
of  I  determined  by  assumption  I  is  one  in  which  the  received  information 
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JF  K«W)'  *]d*  -  F  [^^77.  ^  dxd"  • 

■  J  Fjf?'(x)'*  f(x^j  f,(x)dx-J  Jq(m>Ff  ,  f(x)J  f '  (x)dxdm 


is  independent  of  the  choice  of  f. 

(3)  Let  F(u,v)  -  uG(ujV).  Then  (2)  becomes 


JV0G(p0/r\f)dx  -rj^GCP^f' 


,f )dxdm. 


(5)  Subject  f  to  the  variation  ^  f(x)  «*Sw(x).  Since  (4)  is  independent 
of  f  the  corresponding  variation  of  (4)  must  vanish: 

(6.D  -  JPoGu(P0/f,,f)Wf,2dx  -  JJqp^Gu(pm/f>f)w'A,2dxdm  + 

(6.2)  +  JpQGv(po/f ',f)wdx  >  JjqpmGv(pm/f',f)wdxdm  -  0 

Since  w'  can  be  very  large  compared  to  w  lines  (6.1)  and  (6.2)  must  vanish 
separately.  The  vanishing  of  (6.1)  implies  in  turn  that 

(7)  PoGu(po/f,'f)  "J'^u(pm/f,'f)dm  °  0 

(8)  Setting  Gu(u,v)  =  r(u,v)/u2  we  obtain 

(9)  r(pQ/f'f)  + Jqr(pm/f»,f)dm  «  0. 

Note  that  a  variation  A  p(x,y)  *=  6  h(x)k  (y) ,  where  J*  hdx  -Jkdx  =  0,  is  an 
admissible  variation  of  p(x,y)  providing  h  and  k  are  appropriately 
bounded.  Such  a  variation  produces  the  following  variations  in  the  as¬ 
sociated  distributions: 

(10)  A  p0(x)  -  0 
<  A  q(m)  =  0 

A.  p  (x)  -  €h(x)k(ra)/q(m). 

4  m 

Subjecting  the  respective  quantities  of  (9)  to  the  variations  pre¬ 
scribed  by  (10)  yields 
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(11)  J  ru(pm/f 'if)k(m)dm  -  0. 

Due  to  the.*  arbitrariness  of  k(m)  (11)  implies  that 

(12)  ru(pm/f 1 jf)  *  function  independent  of  m. 

Subjecting  (12)  to  another  variation  of  the  type  (10)  we  obtain 

(12)  +Z^(12)  *  independent  of  m;  therefore 

(13)  f'A(12)  *  6  ruu(pm/f  ’  >f  )h(x)k(m)/q(m)  *  indep.  of  m. 

From  the  arbitrariness  of  k(m)  it  clearly  follows  that 

(14)  r  (u.v)  =  0 
Combining  (I4)  and  (8)  gives 

(15)  G(u,v)  =  a(v)lnu+b(v)/u+e(v) 

*  .* 

for  some  functions  a(v),  b(v),  c(v). 

Returning  now  to  (6.2)  we  see  that  it  implies 

(16)  poGv(po/f,f)  +  j(ipmGv(pm/fSf)dm  -  0. 

(17)  Let  Gv(u,v)  »  s(u,v)/u.  Then  (16)  becomes 

(18)  s(pQ/f,>f)  ♦ J  qs(pm/f ' f )dm  «  0.  Ab  (18)  is  of  exactly  the  same 

form  as  (9)  it  similarly  implies  that 

(19)  suu(u,v)  =  0. 

Combining  (19)  and  (17)  yields 

(20)  Gv(u,v)  =  A(v)  +  B(v)/u. 

But  (15)  implies 

(21)  Gv(u,v)  *  a' (v)lmn-b '  (v)/u-*-c' (v). 

Combining  (20)  and  (21): 

(22)  a'(v)  =  0. 

Thus 

(23)  G(u,v)  **  const  •  lnu+b(v)/u+c(v)  and 


t 
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(H)  J' ru(pm/f  %f)k(m)dm  -  0. 

Due  to  the  arbitrariness  of  k(m)  (11)  implies  that 

(12)  ru(Pm/^' t?)  m  function  independent  of  m. 

Subjecting  (12)  to  another  variation  of  the  type  (10)  we  obtain 
(12)  +^(12)  *  independent  of  m;  therefore 

(13)  f'A(12)  «  *  ruu(pm/f 1  >f  )h(x)k(m)/q(m)  *  indep.  of  m. 

From  the  arbitrariness  of  k(m)  it  clearly  follows  that 

(14)  ruu(u,v)  =  0 
Combining  (14)  and  (8)  gives 

(15)  G(u,v)  =  a(v)lnu+b(v)/u+c(v) 
for  some  functions  a(v),  b(v),  c(v). 

Returning  now  to  (6.2)  we  see  that  it  implies 

{  (16)  P0Gv(P0/f'^)  +  J<lPmGv(pra/f',f)clm  -  0. 

(17)  Let  Gv(u,v)  =•  s(u,v)/u.  Then  (16)  becomes 

(18)  s(pQ/f ',f)  + J  qs(pm/f 'f )dm  -  0.  As  (18)  is  of  exactly  the  same 

form  as  (9)  it  similarly  implies  that 

(19)  suu(u,v)  =  0. 

Combining  (19)  and  (17)  yields 

(20)  Gv(u,v)  =  A(v)  +  B(v)/u. 

But  (15)  implies 

(21)  Gv(u,v)  *  a' (vjlnu+b* (v)/u+cf (v). 

Combining  (20)  and  (21): 

(22)  a'(v)  =  0. 

Thus 

(23)  G(u,v)  -  const  •  lnu+b(v)/u+c(v)  and 
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(24)  F(u,v)  -  const  •  ulnu*uc(v)+b(v). 

Substituting  (24)  in  (1): 

(25)  R/const  -  J*  polnpQdx  “  JJ^mlnpmdxdm  +J  (pQ  -  Jqpmdm)c(x)dx» 

But  J'q(ra)pni(x)dm  -  Jq(m)p(x,m)/q(m)dm  -  pQ(x). 

Therefore  the  last  integral  of  (25)  vanishes,  making  it 

(26)  R/const  -n  p(x,y)lnpQdxdy  -JJp(x,m)ln  dxdm  - 

-  JJp(x,y)ln  d*iy, 

as  was  to  be  proved. 

Arbitrarily  setting  the  const  of  (26)  equal  to  -1,  we  obtain  four 
equivalent  representations  of  R:  .  (4) 

<t  -  ||  P(*,y)ln  d«dy  - 

’  -IfpWlnpC*)^  -  J  <l(y)lnq(yjdy  » JJ'p(x,y)lnp(x,y)dxdj  - 

*  -|P p(x)lnp(x)dx  + rPp(x,y)lnp  (x)dxdy  - 

4  y  (5) 

-  -  :j>q(y)lnq(y)dy  ♦  J'J'p(x,y)lnqx(y)dxdy. 

In  the  discrete  case  the  f  ormulas  reduce  to 

H  -X  ■ 

-  -5IP(i)lnP(i)  -21  Q(j)lnQ(j)  ♦![_  P(i,j)lnP(i,j)  - 

i  i  i,J 

(4)  Sometimes  it  is  convenient  to  use  the  base  2  for  the  logarithm;  in 

that  case  const  -  -l/ln2. 

(5)  More  compact  representations  of  these  relations  will  be  given  in 


section  IV 
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( 


-  -  H  P(i)lnP(i)*]T?  P(i,j)lnP.(i)  - 

i  i,J  J 


-  -  21  Q( j )lnQ( j )♦  XI  P(i,j)lnQ.(j). 

J  i  ,i 

3.  Information  Rate  of  a  Source 

The  information  rate  of  a  source  is  defined  in  terms  of  the 
per-symbol  rate  at  which  information  produced  by  the  source  is  capable  of 
being  received. 

Consider  the  expression  for  information  receipt  rate 


R  -  -  Jp(x)lnp(x)dx  Jp(x>y)inpy(x)dxdy 

derived  in  the  last  paragraph.  On  first  thought  one  would  be  inclined  to 
define  the  information  rate  of  the  source  as  the  value  of  R  that  would  be 
obtained  if  symbols  emitted  by  the  source  (described  by  the  distribution 
p(x))  were  received  in  the  absence  of  noise,  that  is  with  Py(x)  “  &(x-y)- 
In  general,  however,  the  right  side  of  the  above  -xpression  for  R  will 


become  infinite  for  this  type  of  transmission. 


(6) 


_ii  order  to  make  it 


unnecessary  to  set  Py(x)  “  &(x-y)  information  rate  of  a  source  will  be 
defined  relative  to  a  fidelity  criterion. 

Let  ^>(x,y)  be  a  continubus  function  of  x  and  y  whose  value  is 
a  measure  of  the  punishment  meted  out  if  the  symbol  y  is  received  as  a 
result  of  the  source  emitting  the  symbol  x.  'Prer  nably  ^  (x,x)  -  0; 
that  is,  there  is  no  punishment  if  the  emitter  symbol  is  also  the  received 
symbol . ) 


(6)  The  expression  for  R  become  infinite  if  p  (x)  -  *(x-y).»  providing 

y 

the  source  is  not  completely  discrete0 
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The  average  amount  of  punishment  per  transmitted  symbol  is 


(7) 


,71 


v  -  J  J'  P(x,y)  (x,y)dxdy. 

Let  us  call  v  the  quality  of  the  system  v 1 '  .  The  information  rate  of  the 
source  with  given  p(x)  relative  to  the  fidelity  criterion  v  ”  J'J'  P(^  dxdy 
is  defined  as  the  minimum  information-receipt  rate  necessary  to  preserve 
the  quality  v.  The  minimum  is  taken  over  all  possible  noise  conditions: 


1 


source 


£(y>J'J'p<x'y)ln  dxdy  with 

JJp<x,y>  f  (x,y)dxdy  «=  v  =  const. 

For  discrete  transmission  systems  the  rate  of  the  source  is 


'source 


\(S)  p(i^)lnprffeif 


with 


P(i,j)  <?  -  v  »  const. 


i,J  *  ^  txi’J> 

In  this  case  it  is  possible  to  obtain  the  rate  of  the  source  in  an  absolute 
sense  by  requiring  perfect  fidelity;  i.e,,  by  requiring  P(i,j)  =  P(i)6^j. 

This  means  Q^(j)  =  and  Q(j)  =  P(j).  Therefore 

R  p(i)lnP(i) 

source  absolute  i  ' 

In  order  to  clarify  the  remarks  of  page(l6)we  can  think  of  the  case  where 
the  source  symbols  have  a  continuous  distribution  as  a  limiting  case  of  the 
discrete  situation,  with  the  help  of  the  substitution  P(i)  **  p(x^)^x.  The 


(7)  "Infidelity"  would  be  a  better  word. 


,  f  f- 
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fi*" ' 

ll 


fact  that  R  becomes  infinite  as  Ax-*0  indicates  that  from  the  absolute 

s.  ab, 

point  of  view  (i.e.  without  reference  to  a  fidelity  criterion)  continuous 
sources  have  an  infinite  information  rate  per  emitted  symbol. 

A  formal,  although  not  very  useful,  expression  for  the  rate  can  be 
obtained  by  carrying  out  the  minimization  procedure  indicated  in  the  defi¬ 
nition.  We  will  carry  it  out  for  the  discrete  case,  .and  then  state  the 
analogous  results  for  the  more  general  case. 

It  is  desired  to  minimize 

(1)  -  ^Q(j)lnQ(J)  ♦  P(i>j)lnQi(j)  - 

-  D  -  -  ^3  P(i)Q1(j)lnXp(m)Qta(j)  ♦  ^  P(i)Qi(j)lnQ1(j) 

for  given  P(i)'s  over  all  Q^(j),  subject  to 


(2.1) 

(2.2) 


Eq  -  ^  P(i)Qi(j)^>(i,j)  -  v  -  const 
5i  "  ^TQi(j)  *  1  -  1*2,...) 


According  to  the  method  of  Lagrangian  multipliers,  the  minimum  of 
D  will  be  obtained  when 


bU  >  .  ~i 

(3)  *  t  h  ■ 0 


(k,l  -  1,2, . . . ) 


where  the  are  adjusted  to  satisfy  (2).  From  (1): 


(4) 


6D 


syi7 


■  P(k)  log 


Sc(l) 

fpcrsyrr 


-  P(k)log(P1(k)/P(k)) 


From  (2) 

(5)  Z  \ 


b  E, 
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Putting  (4),  (5)  into  (3): 

(6)  P-^k)  ”  P(k)  exp  £-Xq ^  (k,l )  -\k/P(k)j  - 

-  A(k)  exp(-Xo  ^ (k,l)). 

where  the  A(k)'s  are  determined  as  functions  of  XQ  by 

(7)  ^  A(k)exp(-Xo^  (k,l))  “1  (1  **  1,2,...) 

under  the  restriction  that  A(k)  =  0  if  P(k)  -  0.  XQ  is  adjusted  to 
satisfy  (2.1). 

Note  that  (7)  determines  A(k)  as  the  solution  of  a  non-homogeneous 
system  of  linear  algebraic  equations.  Unfortunately,  D  cannot  be  evaluated 
directly  from  a  knowledge  of  P(k),  and  P^(k).  It  is  first  necessary  to 
evaluate  some  one  of  the  quantities,  P(i,j),  Q^(j),  or  Q(j),  and  this  re¬ 
quires  the  solution  of  a  system  of  linear  algebraic  equations. 

This  is  the  reason  why  the  expression  (6)  has  only  limited  practical 
value  for  evaluating  specific  inf ormation  rates  of  sources. 

In  the  special  case  where  P(k)  /  0  for  any  k  =•  0,+l,+2,...  -»  and 

^(i,j)  »T(i-j)  the  solution  of  (7)  is 

(8)  A(k)  =■  indep  of  k  -  o((XQ),  making  the  a-posteriori  probability  that 
i  was  transmitted  an  exponentially  decaying  function  of  the  error  metric 

X  (i-j ) : 

(9)  Pj(i)  -°((X0)exp(-X0^(i-j)). 

Solutions  for  the  continuous  case  are  obtained  by  replacing  the 
probabilities  in  the  above  formulas  by  the  corresponding  distributions, 
and  the  sigma  signs  by  integrals.  This  transforms  the  linear  algebraic 
equations  into  integral  equations. 
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4.  Channel  Capacity- 

In  the  theory  of  information  the  ability  of  a  channel  to  transmit  in¬ 
formation  produced  by  a  source  to  the  receptor  is  described  by  a  quantity 
known  as  channel  capacity.  The  concept  of  the  channel  is  needed  to  take 
into  account  the  fact  that  the  symbols  emitted  by  the  source  are  not 
necessarily  the  symbols  arriving  at  the  receiver.  Loosely  speaking, 
therefore,  the  channel  is  that  part  of  a  2-point  one-way  communication 
system  where  the  noise  occurs. 

Since  the  physical  nature  of  transmission  links  is  often  of  such  a 
nature  as  to  limit  the  number  of  symbols  per  second  that  can  be  transmitted 
through  it,  channel  capacity  will  be  defined  on  a  per-unit  time,  instead  of 
a  per-unit  symbol  basis.  Let  M  be  the  number  of  symbols  per  second,  and 
let  qx(y)  be  the  transition  probability  distribution  describing  the  noise; 
then  the  channel  will  be  operating  at  its  "capacity"  C  when  the  source  is 
properly  "matched"  to  the  channel: 

C  -  max  M  f  f 

p(x)  J J 

The  right  side  of  the  above  equation  will  be  maximized  for  some  distribution 

p(x).  The  channel  will  be  able  to  transmit  the  maximum  amount  information 

per  second  if  it  is  fed  by  a  source  governed  by  the  distribution  p(x).  This 

concept  is  valuable  because  it  is  always  possible  to  code  the  output  of  a 

source  to  give  the  encoded  symbols  an  arbitrary  given  distribution. v  1  It 

should  be  noted  that  under  certain  conditions  it  may  be  desirable  to  maximize 

(9) 

the  channel  over  only  a  restricted  class  of  permissible  p(x)'s  '7',  In 
that  case  the  channel  capacity  is  relative  to  the  permissible  set  of  input 
symbols. 

(8)  Details  will  be  given  in  a  later  section. 

(9)  For  instance  we  may  permit  only  p(x)'s  with  a  given  second  moment 

(a  power  limitation) 


1 73 
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5.  Example:  Capacity  of  a  Band-limited  Channel  with  White  Noise 

The  restriction  of  band  limitation  of  say,  from  0  to  W  cycles  per 
second,  means  that  the  spectra  of  both  the  function  emitted  by  the  source 
and  the  noise  are  limited  to  the  interval  (0,W).  Such  functions  can  be 
written  in  the  form 


(D  f(t)  -^7f(k/2W)i  (t-k/2W) 


sin2aWt 


where  $  (t)  . 


Since  J  (t-m/2W)^  (t-n/2W)dt  »  6^/2 W  for  integral  m  and  n 

-  om 

(2)  power  of  f(t)  -  lim  (l/2T)  J  f2(t)n+  - 

Tv  ob  -T 


(1/2W)  lim  (1/2T) 

T-»*  k— 2WT 


f2(k/2W) 


lim  (l/2n) 


f2(k/2W)  -  f2(k/2W). 


From  (1)  we  see  that  f(t)  can  be  thought  of  as  produced  by  a  source 
that  emits  a  pulse  shape  £  with  amplitude  x^  -  f (k/2W)  at  instants  of  time 
l/2W  seconds  apart.  If  the  x^  are  picked  from  a  distribution  p(x)  then  (2) 
indicates  that  the  power  of  f(t)  will  be  the  second  moment  of  p: 


Joe 

x2p(x)dx 


A  representation  of  band-limited  white  noise  of  power  N  can  be  obtained 
by  means  of  the  concept  that  it  results  when  a  large  number  of  correspondingly 

(10)  This  formula  can  be  obtained  by  expanding  the  spectrum  cf  f(t)  in 

a  Fourier  series,  and  then  using  the  Fourier  integral  representation 
for  f(t)*. 
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band-limited  functions  are  added  at  random.  Let  g(t)  represent  the  noise, 
and  f^(t)  typify  the  functions  that  add  to  produce  the  noise,  evidently 
g(k/2W)  «]>f^(k/2W).  By  the  central  linr.it  theorem  x^  -  g(k/2W)#  will  have  & 
Gaussian  distribution,  which  by  (3)  must  have  a  second  moment  equal  to  N: 

(4)  r(x)  -  (1/ </2nN)exp(-(x2/2N) )=distribution  of  g.x^’s  corresponding 
to  two  different  values  of  k  are  independent. 

In  the  preceding  paragraphs  we  have  3poken  of  f(k/2W)  as  the  coefficient 
of  the  elementary  pulse  shapes  that  make  up  the  signal.  It  is  apparent  that 
the  pulse  shapes  themselves  merely  act  as  carriers.  A  model  for  the  entire 
process  is  obtained  if  we  consider  the  source  to  emit  a  sequence  of  real 
numbers  picked  from  a  distribution,  say  p(x),  with  second  moment  5  (the 
power  of  the  source).  These  real  numbers  are  the  "symbols"  produced  by 
the  source,  the  symbol-producing  rate  being 

(5)  K  ■>  2W  symbols  per  second. 

The  effect  of  the  noise  is  to  add  a  second  sequence  term  by  term  to  the 
source  sequence,  with  the  terms  of  the  second  sequence  picked  at  random 
from  the  distribution  (4). 

Due  to  the  additive  nature  of  the  noise 

(6)  p(x,y)  «  p(x)r(y-x).  Therefore 

(7)  R  -  -J1  q(y)lnq(y)dy+JJ  p(x,y)lnqx(y)dxdy  - 

*  -  Jq(y)lnq(y)dy+ J  r(z)lnr(z)dz. 

By  (4) 

(8)  J'r(z)lnr(z)dz  -  (-|^)ln(2neN) 

The  problem  now  is  to  maximize  -  |q(y )lnq(y)dy  over  all  p(x).  Since  the 

u 

total  power  at  the  receiver  is  5+N  the  second  moments  of  p(x)  and  q(y)  must 
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be  fixed  at  S  and  S+N  respectively.  If  the  maximization  of  (7)  were  to 
be  carried  out  over  all  possible  q(y)  instead  of  p(x)  (as  it  actually 
must  be)  we  could  use  the  easily  proved  theorem  that 

(9)  max  -  fq(y)lnq(y)dy  with  y^q(y)dy  ■  fixed  is  obtained  v/hen  q(y) 

q(y)  J 

is  Gaussian;  i.e.,  max  -  q(y)lnq(y)dy  «(-jr)ln2ne(S+N). 

q(y)  J 

It  is,  however,  certain?„y  true  in  view  of  the  preceding  that 


<  value  obtained  when 


(10)  max  -T  q(y)lnq(y)dy  with  x2p(x)rix  -  S 

_p(x)  J 

q(y)  were  Gaussian  =  ("fc  )ln2ne(S+M) . 

Now  from  the  equation 

(11)  q(y)  -  Jp(x,y)dx  =  Jp(x)r(y-x)dx 

and  the  fact  that  r(z)  i3  Gaussian  it  happens  to  follow  fortuitously  that 
it  is  possible  to  make  q(y)  Gaussian  by  taking  p(x)  Gaussian: 

(12)  p(x)  «»  (1/  *J~2nS)exp(-(x2/23). 

Therefore  the  inequality  of  (10)  becomes  an  equality,  anti  we  have,  combining 

(5),  (7),  (&),  (10) 

(13)  C  -  Wlog(5+N/N) 

as  the  capacity  of  the  model  channel.  But  the  model  channel  was  obtained 
•  wm  the  real  channel  by  a  relabeling  process,  namely  by  relabeling 
sequences  of  pulses  as  sequences  of  real  numbers.  Since  (7)  was  derived 
under  the  postulate  that  it  is  invariant  under  relabeling  (13) 

is  also  the  capacity  of  the  real  channel.  According  to  (12)  the  channel 


(11)  In  the  derivation  of  R  it  was  actually  only  postulated  that  invariance 
held  if  real  numbers  were  relabeled  as  other  real  numbers,  and  only 
one-dimensional  distributions  were  considered.  If  the  distributions 
had  been  taken  multi-dimensional  the  above  statement  would  have  fol¬ 
lowed  rigorously. 


J Tt 


is  maximized  when  the 


source  emits  white  noise 


I 
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IV.  Properties  of  Information  Rates. 

1 .  Summary 

The  various  information  rates  are  expressed  in  terns  of  the  entropy 
and  conditional-entropy  functions  which  are  defined  and  studied.  It  is  shown 
that  the  number  of  highly  probable  long  sequences  of  symbols  emitted  by  a 
source  is  closely  related  to  the  information  rate  of  the  source. 

In  the  last  paragraph  the  fundamental  theorem  for  2-point,  1-way 
communication  is  derived.  This  states  that  with  a  proper  en-  and  de¬ 
coding  equipment  the  output  of  a  source  can  always  be  transmitted  in  the 
presence  of  noise,  without  error,  at  a  rate  determined  by  the  channel 
capacity  and  the  information  rate  of  the  source. 

2.  Entropy  Functions 

Let  the  entropy  G  of  a  distribution  function  f(x)  be  defined  as 
(1)  G  -  -  Jf(x)logf(x)dx. 

Therefore  the  entropy  of  the  source  is 


(2)  G(S)  -  -  Jp(x)logp(x)dx,  where  S  stands  for  Source, 
and  the  entropy  of  the  received  symbols  is 

G(T)  *  -  Jq(x)logq(x)dx,  where  T  stands  for  Receiver. 

We  also  define  the  mixed  or  relative  entropies 

gt(s)  “  -J  |*  p(x*y)i°gpy(x)dxd)r  and 
GS(T)  -  -J  J  P^yjq^yjdxdy. 

(4)  is  spoken  of  as  the  "entropy  of  S  knowing  T",  and  (5)  the  "entropy  of 
T  knowing  3".  By  thinking  of  the  pair  (x,y)  as  one  symbol,  we  can  extend 
(1)  to  cover  the  concept  of  joint  entropy: 


(3) 

(4) 

(5) 
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(6)  G(T,S)  »  ~Jf  p(x,y)logp(x,y)dxdy. 

It  can  easily  be  shown  that 

(7)  G(T,S)  -  G(T)  +  Gt(S)  -  G(S)  +  Gg(T)  -  G(5,T). 

It  also  follows  from  (4)  and  (5)  that  if  x  and  y  are  independent  then 

(8)  GS(T)  -  G(T)  and  GT(S)  -  G(S). 

Thus  if  T  and  3  are  independent 

(9)  G(T )  -  G(T )  +  G(S). 

For  the  discrete  case  it  is  desirable  to  introduce  analogous  quantities 

(10)  H(S)  -  -  ^P^ogP^^ 

(11)  H(T)  »  -  2LQjlogQj 

(  (12)  HS(T)  --21  P(i,j)logQi(j) 

(13)  Ht(S)  -  -  ^  Pd^jJlogPjd) 

(14)  H(T,S)  -  H(S,T)  -  Ht(S)  ♦  H(T)  -  H(3)  *  Hg(T)  - 

“  -  P(i,j)logP(i, j). 


/ 
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It  is  possible  to  express  information  rate  in  terms  of  the 
quantities  defined  above.  The  expression  is  in  the  continuous  case 

(15)  R  -  G(S)  -  Gt(S)  -  G(T)  -  GS(T). 

According  to  III,  5,  (7)  when  the  noise  symbols  are  "additive"  and 
independent  of  the  source  symbols  (15)  becomes 

(16) ''  R  -  G(T)  -  G(N) 

where  G(N)  =  -  J'r(x)logr(x)dx  is  the  entropy  of  the  noise. 

For  the  discrete  case  (15)  degenerates  into 

(17)  R  -  H(S)-Ht(S)  -  H(T)-HS(T).  (12) 

The  reader  may  have  noticed  that  G(S)  is  actually  the  uncertainty 
function  U(p)  arrived  at  in  III,  2,  (24).  (111,2,(25)  shows  that  b(v), 
and  c(v)  appearing  in  111,2,(24)  are  irrelevant.).  In  other  words  (1) 

(13) 

is  a  measure  of  the  uncertainty  associated  with  the  distribution  f(x)  v  J 
More  generally,  for  instance,  G,j(S)  is  the  uncertainty  of  the  symbol  at  5, 
knowing  the  symbol  at  T.  With  this  interpretation  we  can  easily  "derive" 
relation  (16).  One  need  merely  note  that  Gg(T),  the  uncertainty  of  what 
was  received,  knowing  what  was  emitted,  is,  in  the. case  of  independent  ad- 


(12)  This  is  true  even  though  none  of  the  G’s  individually  degenerate  into 

the  corresponding  H*8.  AG  can  be  thought  of  as  differing  from  the 
corresponding  H  by  an  infinite  additive  constant,  these  constants 
cancelling  out  when  the  difference  of  two  G's  or  H's  is  taken. 

(13)  In  the  discrete  case  H  ■  -^-f^logf^  is  a  measure  of  the  uncertainty 

associated  with  the  probabilities  (f^fg,.,.,?  )  in  q\.ite  an  absolute 
sense.  It  can  be  shown  that  H  will  be  a  maximum  when  all  the  f 's  are 
equal,  and  it  is  obvious  that  H  is  zero  if  and  only  if  one  of  the  f’s 
is  unity  and  all  others  vanish. 
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ditive  noise,  the  uncertainty  of  received  signal  plus  noise  with  the 
emitted  signal  known,  this  being  simply  the  uncertainty  of  the  noise, 
G(N).  Substituting  G  (T)  -  G(N)  into  (15)  yields  (16). 

3.  Laws  of  Long  Sequences 

This  paragraph  lists  some  properties  of  long  sequences  of  output 
symbols  from  a  discrete  source,  transmitted  over  a  noisy  channel. 

Law  I: 

Every  emitted  sequence  of  length  L»1  symbols  has  w.h.p. 

exp  (H0(T)L)  received  sequences  of  length  L  as  possible  consequences,, 

J 

Proof: 

If  the  sequence  is  emitted  it  will  w.h.p.  contain  the 

symbol  P(i)L  times  (i  -  1,2, ...,n  where  n  is  the  number  of  possible 

symbols).  The  emitted  message  can  therefore  be  considered  to  consist  of 
n  (possibly  interlaced)  blocks  of  P(i)L  symbols  each.  Each  such  i*th 
block  will  produce  a  block  of  P(i)L  received  symbols,  containing  the 
j'th  symbol  Q^(j)P(i)L  -  P(i,j)L  times.  The  probability  of  a  particular 
block  of  received  symbols  is  therefore  w.h.p. 

"  P(i,J)L 


H  M 


(14)  The  phrases,  "with  high  probability"  (w.h.p.),  and  "with  probability 

zero"  (w.p.z.)  are  to  be  interpreted  as  meaning  that  the  probabilities 
referred  to  approach  1  and  0  respectively  as  L-*00.  Sometimes  when 
elements  of  a  set  V  are  w.h.p.  also  in  the  set  W,  we  will  say 
"All  elements  of  V  are  in  W". 

• 


••■"or1* 
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i 

The  probability  of  the  entire  received  sequence  is  therefore 

A  [«!<•»]  P<i'J>L  "  «P(-Hs<T)l>. 

The  desired  result  now  follows  because  each  of  the  h.p.  received  mes¬ 
sages  are  equally  likely. 

Corollary  I;  (The  dual  of  Law  I) 

Corollary  II: 

The  number  of  h.p.  emitted  sequences  of  length  L  is  exp(H(S)L). 
Corollary  III;  (The  dual  of  Corollary  II) 


Fig.  ^  Transmission  of  Sequences  of  L>>1  Symbols  Over  Noisy  Channel 


Figure  3  illustrates  the  situation  occurring  when  long  sequences  are 
transmitted  over  a  noisy  channel.  Received  and  transmitted  sequences  are 
represented  as  points  on  the  right  and  left  respectively.  Each  fan  shows 

(15)  The  dual  is  obtained  by  interchanging;  the  words  "source"  and  "receiver", 
the  symbols  i  and  J,  P(i)  and  Q(j),  and  Q^( j)  and  Pj(i), 

t 
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how  many  received  sequences  a  given  emitted  sequence  can  result  in. 

Since  the  fans,  in  general,  overlap  the  receiver  cannot  know  exactly 
what  was  transmitted.  However,  if  only  a  few  of  the  possible  points  on 
the  left  were  actually  used  to  represent  messages  it  is  conceivable  that 
the  resulting  fans  might  not  overlap.  A  necessary  condition  for  this 
to  occur  is  certainly  that  no  more  than 

exp(H(T)L)/exp(H3(T)L)  -  exp  [(H(T)-Hg(T ) )l] 
of  the  points  on  the  left  are  used. 

Law  II: 

If  less  than  exp  ^(H(T)-Hg(T )-6)L j  (&>0)  points  are  selected  at 

random  from  the  source  side  of  the  channel  the  resulting  fans  will  over¬ 
lap  w.p.z. 

Proof : 

Suppose  exp  ^ (H (T)-HS(T)  -&)Lj  points  are  selected  at  random  from 
the  left,  making  the  probability  that  a  particular  point  is  a  selected 
point 

exp  £(H(T)-Hg(T)-6 )lJ  /exp(H(S)L)  -  exp  (-HT(S)L-6L) . 

No  two  fans  emanating  from  selected  points  will  overlap  if  any  given 
point  on  the  right  cannot  be  "caused"  by  more  than  one  selected  point. 

Each  point  on  the  right  can  a-priori  (i.e.  if  no  selection  of  points  on 
the  left  were  used)  be  caused  by  exp(H^,(S)L)  left  points.  The  probability 
P  that  at  least  two  of  these  points  are  selected  points  is  less  than  1-A, 
where  A  is  the  probability  that  none  of  the  exp(HT(S)L)  points  is  a 


(16)  This  is  of  course  a  much  weaker  theorem  than  one  giving  specific 
instructions  as  how  to  pick  the  points  on  the  left  to  get  the 
minimum  possible  overlap.  Stronger  theorems  have  been  obtained 
for  specific  channels.  See  Refs.  2,  11,  20. 
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selected  point. 

r  l  expH  (S)L 

A  -[  l-exp(-HT(S)L-6L)|  — t  1  as  L — > •  . 

Therefore  P—*0  as  L  — *  ;  q.e.d. 

Corollary  IV :  (The  dual  of  Law  II) 

Corollary  V; 

If  exp  |  (H(T)-H2(T)+6)l|  points  (6>0)  are  selected  at  random 

(17) 

from  the  receiver  side  of  a  channel  the  fans  emanating  from  them  v  ' 
cover  w.h.p.  all  the  exp(H(S)L)  points  at  the  source  side. 

Proof: 

By  Corollary  IV  if  exp  [(H(T)-HS(T)-6)L ]  were  selected  at  the 
right  there  would  be  no  overlapping  of  fans  so  that  exp  £(H(3)-6)l] 
of  the  exp(H(3)L)  points  on  the  left  are  covered.  The  desired  result 
follows  easily. 

4.  Fundamental  Theorem  for  Transmission  over  Noisy  Channel 
Theorem  II: 

It  is  possible  to  match  a  source  producing  R  units  of  in¬ 
formation  per  symbol  (relative  to  a  fidelity  criterion)  to  a  channel  of 
capacity  C  by  means  of  coders  in  such  a  way  that  if  less  than  C/R  symbols 
per  unit  time  are  transmitted  the  transmission  quality  will  satisfy  the 
fidelity  criterion. 

Proof: 

The  proof  will  consist  in  describing  various  coders  and  decoders 

(17)  These  fans  originate  at  the  right  and  spread  out  toward  the  left. 

They  are  the  duals  of  the  ones  shown  in  Fig.  3*  and  indicate  the 
number  of  emitted  sequences  that  could  have  caused  the  received 
sequence  from  which  they  emanate. 


a 


RM-454 

9-20-50 

-32- 


by  means  of  which  it  is  possible  to  attain  the  objective  announced.  The 
pertinent  block  diagram  is  shown  in  Figure  4. 


Fig.  4  Block  Diagram  for  Transmission  System  With  Coding  Equipment 

(a)  The  first  quantizer. 

The  FQ  (first  quantizer)  is  not  needed  if  the  source  is  discrete. 
If  the  source  is  not  discrete  the  FQ  is  used  (purely  for  the  sake  of 
mathematical  convenience)  to  quantize  it  into  very  fine  but  discrete 
levels.  It  is  intuitively  obvious  that  very  fine  quantizing  has  no 
appreciable  effect  on  the  rate  of  the  source.  Thus  the  information  rates 
at  u  and  v  are  the  same. 

(b)  The  second  quantizer 

The  SQ  (second  quantizer)  is  not  needed  if  the  fidelity  cri¬ 
terion  requires  perfect  transmission.  If,  on  the  other  hand,  it  is  not 
dictated  that  the  symbols  at  v  be  transmitted  with  perfect  fidelity 

C 18 ) 

(i.e.  if  the  rate  R  at  v  is  not  the  absolute  rate  v  at  v)  the  SQ 
quantizes  the  symbols  at  v  in  such  a  way  that  the  quantized  symbols  put 
out  at  w  have  an  absolute  rate  R.  (Therefore  from  w  onward  there  must 
be  no  more  distortion  in  the  transmission  system.) 

Fundamentally,  the  3Q  operates  by  first  ascertaining  which  of  an 
equivalent  number  of  classes  a  given  sequence  v  belongs  to,  and  then 
transmitting  a  code  number  for  that  class;  for  instance,  the  code  number 
might  simply  be  the  "central"  sequence  of  the  particular  class.  Spe¬ 
cifically,  these  clashes  and  their  code  symbols  can  be  determined  with 


(18)  Cf.  Ill,  3  as  reference  for  this  section 
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the  help  of  Corollary  V  as  follows:  We  consider  v  the  "emitted" 
symbols  and  w  the  "received"  symbols.  With  this  notation  the  rate 
of  information  per  symbol  at  v 

R  =  "(cj)  h  P(i'JUn 

with  P(i, S) jO  (i,  J)  =  const. 

Suppose  P ’ C i > j )  is  the  P(i,j)  for  which  the  minimum  in  the  above 
definition  of  R  is  obtained.  Select,  according  to  the  method  of  Cor.  5, 
exp(RL+6L)  points  on  the  "receiver"  side  of  the  transmission  system 
obtained  with  P(i,j)  »  P’(i,j).  The  SQ  is  to  be  constructed  so  that 
it  will  use  a  particular  selected  point  as  the  code  for  the  class  of  points 
caught  in  the  fan  emanating  from  the  selected  point.  The  S'4  obtained 
by  this  construction  satisfies  the  fidelity  criterion,  and  has  the 
property  that,  looking  into  its  output  terminal  w,  we  see  a  source  of 
absolute  rate  R  units  of  information  per  symbol. 

(c)  The  channel  matcher 

The  CM  (channel  matcher)  is,  as  its  name  indicates,  a  device 
for  encoding  the  symbols  arriving  at  w  into  symbols  that  are  best  able 
to  combat  the  noise  present  in  the  channel.  Since  it  must  be  possible 
to  recover  the  symbols  w  with  perfect  accuracy  at  the  receiver,  the  CM 
must  be  a  one-to-one  coder;  that  is,  it  must  be  reversible. 

For  purposes  of  discussing  the  CM  consider  x  to  be  the  "emitted" 
symbols  and  y  the  "received"  symbols.  Assume  that  the  symbols  x  are 


(19)  Cf.  Ill,  4  as  reference  for  this  section 
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”  i71 

I 

'f 

if 

I 

I 

produced  according  to  the  distribution  P"(i)  and  transmitted  at  the 

l! 

rate  K*  symbols  per  second.where  p"(i)  and  M*  maximize  the  channel  i.e. 

1 

assume  that 


C  -  max  M  P(i, j)ln  w>vi4r7T  >  where  K  is  the  number  of 

P(i)  TTJ  P(i^UJ 

symbols  per  second,  is  obtained  for  P(i)  -  P"(i)  and  M  »  M* .  If 
H(S")  »  -  ^  P"(i)logP"(i),  then,  according  to  Cor.  II,  if  the  channel 
is  operated  with  P(i)  «*  P"(i)  there  will  be  exp(H(S")K'T)  possible 
h.p.  long  sequences  of  length  T  seconds  at  point  x.  According  to 
Law  II  if  less  than  exp(CT)  of  these  sequences  are  used  as  messages 
the  "receiver"  at  y  will  be  able  to  ascertain  exactly  which  message 
was  sent.  The  oroblem  for  the  CM  is  therefore  only  to  code  the  symbols 
arriving  at  w  into  the  exp(CT)  symbols  that  are  available  for  trans¬ 
mission  without  error.  Since  exp(RL)  symbols  of  length  L  arrive  at 
w  such  coding  will  obviously  be  possible  if  and  only  if  RL<  CT,  i.e. 
if  and  only  if  no.  of  symbols  per  sec.  produced  by  source  «=  L/T<C/k, 

where  R  i3  the  rate  of  the  source,  and  C  the  capacity  of  the 
channel. 

(d)  The  decoder 

The  decoder  performs  the  operation  inverse  to  the  CM,  so  that 
we  end  uo  with  the  same  symb/lc  at  z  that  originated  at  w. 


•  \ 

! 

■* 


iXl 
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V.  Prediction  of  Time  Series. 

1.  Summary 

This  section  outlines  the  philosophy  behind  the  prediction 
problem  for  time  series  chosen  from  an  ensemble  of  time  series  for  which 
a  certain  set  of  multi-dimensional  set  of  probability  functions  exists, 
and  is  a-priori  known. 

2.  Kulti-dimensional  Probability  Distributions 

(1)  Let  zi>z2>* • • *z]> • • •  he  a  typical  time  series  of  an 
ensemble  of  time  series. 

(2)  Let  Vk(y1,y2>...,yk)dy1dy2...dyk  (k  -  1,2,...) 

be  the  probability  that  if  a  block  of  k  consecutive  z's,  beginning 
with  z.  i3  selected  at  random  from  (1)  the  z's  will  lie  in  the 

j+l' 

region 

yiS  zi+j^yi>dyi  ^  ’  l>2,...,k), 

relation  (2)  being  postulated  to  hold  independent  of  j,  and  independent 
of  which  particular  time  series  is  chosen  from  the  ensemble. 

(3)  Let  wk(y1*y2,*",yk;yk+l^dyk+l  (k2D 

be  the  probability  that  will  lie  in  the  region 

'  <1-1'2 . k)- 

If  we  arbitrarily  set 

(4)  WQ(y)  “  Vx(y)  it  follows  that  the  V  and  W  functions  are 
related  through 

(5)  ^k(y^,y2>  •  •  •  *yk )  ■  ^k— l^yl,y2 *  *  *  *  ,,yk— l^^k— 1 'yl#y2^  *  *  *  ,,yk— 1  ’ yk ^ 

if  k^  2. 
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To  obtain  a  complete  statistical  description  of  the  stochastic 

process  in  question  all  the  W.  ’s  (or  what  is  easier  experimentallv, 

#  K 

all  the  V^'s)  must  be  found.  In  most  practical  cases  there  will  be 
no  "influence"  extending  further  than,  say  j  signals.  This  simply 


means  that 

(6)  Wk(yl'y2'* ,yk;yk+l>  *  F(yk-j*l>yk=j*2'"',>yk'yk+l) 

for  k  larger  than  some  sufficiently  large  j. 

3.  Predictability 

Loosely  speaking,  the  more  redundant  a  time  series  is, 
i.e.  the  less  uncertainty  there  is  about  the  next  signal,  knowing  a 
certain  number  of  previous  signals,  the  more  easily  predictable  will 
the  time  series  be.  Some  of  the  terms  used  in  the  preceding  sentence 
can  be  defined  exactly. 

(a)  k -derived  uncertainty  = 

Analogously  to  III,  3  let^(x,z)  measure  the  punishment 
meted  out  if  a  signal  x  is  predicted  to  be  the  symbol  z,  and  let  v 
measure  the  amount  by  which  two  signals  must  differ  in  order  to  become 
practically  distinguishable. 

Define  to  be  the  rate  of  a  mathematically  artificial 
source  that  produces  symbols  x  independently  according  to  the  distribu¬ 
tion  p(x)  »  W.  (y;x)  relative  to  the  criterion 

K  * 

JJ p(x;z) ^ (x,z)dxdy .«  v;  i.e. 

(8)  Rk(y)  -  min^  JJp(x,z)log-P^^j2)  dxdz 


I 
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J  Jp(x,z)  (x,z)dxdy  -  v, 
where  p(x,z)  -  ptxjq^z),  q(z)  -J'  p(x,z)dx. 

Let  be  the  average  of  over  the  oossible  J^J^****^  5 

(9)  Hk  =*J  Rk(y)Vk(y)dy  (a  k-fold  integral) 

Hj^  is  evidently  the  average  amount  of  information  needed  to 
specify  a  signal  if  the  previous  k  signals  are  known.  Thus  it  is  a 
measure  of  the  uncertainty  with  which  we  know  what  a  signal  will  be  if 
we  know  the  previous  k  signals. 

(b)  redundancy 

(10)  Let  H  -  lirc  H, 


The  redundancy  of  the  time  series  can  be  defined  as 

(ID  ^-l-H„/H0 

If  successive  symbols  are  independent  we  will  have 

(12)  W>,Cy;x)  *  W  (x)  (all  k),  and  therefore 

(13)  IL  -  Ho,  so  that  /<-  0. 

If  the  next  signal  is,  on  the  other  hand,  completely  determined 
once  a  sufficient  number  of  preceding  signals  are  known  //»  1. 

It  should  not  be  forgotten  that,  in  general,  the  redundancy  is 
relative  to  the  punishment  function  j>(x,y)  and  the  distinguishability 
criterion  v. 

4.  The  Mechanism  of  Prediction 

(a)  choice  of  the  nunisfriment  function  p(x,y) 

In  order  to  design  a  predictor,  it  is. 


in  principle. 
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* 


necessary  to  first  specify  the  function  of  two  variables  ^ (x,y)  that 
measures  the  punishment  meted  out  if  the  next  signal  is  predicted  to 
be  "y"  but  actually  turns  out  to  be  "x".  Although  the  choice  of 
^  (x,y)  will  be  dictated  by  the  application  of  the  predictor,  its 
selection  Ls  ultimately  a  psychological  problem. 

The  predictor  is  designed  so  as  to  minimize  the  exp»cted 
value  of ^  (x.y).^2^ 

A  common  choice  for^(x,y)  is 

(13)  ^  (x,y)  -  f(x-y), 

in  which  case  the  punishment  depends  only  on  the  error.  For  instance 
the  (for  reasons  of  analytic  simplicity)  popular  least-squares  criterion 

(14)  ^  (x,y)  -  f(x-y)  -  (x-y)2 
is  of  this  type. 

(b)  -unrestricted  versus  restricted  prediction 

The  most  general  form  of  predictor  is  a  computer 
which,  on  the  basis  of  all  information  at  hand,  predicts  a  signal  so  as 
to  minimize  the  expected  value  of  the  punishment .  According  to  the 

(20)  This  might  be  called  a  "rational"  prediction  criterion.  It  is 
conceivable  (in  fact  the  motivation  for  gambling)  to  have  the 
punishment  function  dependent  not  merely  on  x  and  y,  but  also 
on  the  probability  that  x  will  occur.  Maximizing  the  expected 
value  of  ^  in  such  a  case  would  amount  to  an  "irrational" 
criterion.  With  irrational  criteria  it  may  be  desirable  for 
the  predictor  to  play  a  mixed  strategy  against  the  time  series, 
i.e.  to  "toss  a  coin".  With  rational  criteria  it  is  pointless 

to  play  a  mixed  strategy. 
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aforelying  formulation  of  the  prediction  problem  the  computer  can  do 
this  if  it  remembers  all  previous  signals,  computes  the  a-priori  dis¬ 
tribution  of  the  signal  to  be  predicted  according  to  the  functions, 
and  then  minimizes  the  expected  value  of  ^  .  A  process  such  as  this 
can  be  called  "unrestricted"  prediction. 

On  the  other  hand,  consider  the  case  where,  for  practical  reasons, 
it  is  necessary  to  place  theoretically  artificial  restrictions  on  the 
storage  mechanism  and  permissible  operations  assigned  to  the  computer. 
When  this  situation  arises  we  speak  of  "restricted"  prediction.  An 
example  is  the  case  of  so-called  linear  prediction  where  the  computer  is 
permitted  to  evaluate  only  linear  combinations  (with  permanently  fixed 
coefficients)  of  amplitudes  of  past  signals.  Although  a  time  series  of 
{  redundancy  1  is  perfectly  predictable  in  the  unrestricted  sense  it 

may  not  be  so  in  the  restricted  sense. 

The  more  restricted  a  predictor  is  the  larger  the  error  of  pre¬ 
diction  will  be.  On  the  other  hand  the  predictor  may  be  applicable 
to  a  larger  ensemble  of  time  series  if  it  is  more  restricted.  Thus 
restriction  of  predictors  has  among  other  tilings  the  effect  of  trading 
error  for  versatility. 

5.  Examples 

(a)  sine  wave  samples 

Consider  a  source  producing  signals  at  discrete  time 
instants  (1  ■  1,2,...)  according  to  the  recursion  formula 

(15)  r(zi)  -  sin  i  (i  -  1,2,...) 

It  can  be  shown  that  the  points  i  mod  (2n)  cover  the  interval 


f 
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( 0,2n )  in  an  everywhere  dense  fashion  and  in  .such  a  way  that  the  pro¬ 
bability  that  i  mod(2n)  is  between  two  real  numbers  exists  and  is  flat 
over  (0,2it).  Therefore  the  distribution  W  (z)  for  f(z.)  is  the  same  as 
that  obtained  for  sin  t  if  t  is  picked  at  random  from  a  distribution 
flat  over  (0,2rc).  This  latter  is 


(16)  WQ(z)  -j  yQwN/l-z  /  if  |z|<  1 

0  if  jzUl 

(17)  Let  a^  ■»  sink,  b^  =>  cosk. 

Then  if  a  given  signal  has  the  amplitude  z.  it  is  equally  likely  that 

•3 

Vi b* 

(18)  b^Zj+a^  l~2j  or  b^Zj-a^ •  Thus 

(19)  Wi(5ri5y2)  [y2'(blyl+al  ^1_yl  J  +  [y2-(Vl-alV^)  1 

If  two  or  more  consecutive  samples  are  known  all  future  samples 
can  be  predicted  perfectly  because  f(z^)  satisfies  a  difference  equation 
of  the  second  order.  The  distributions  are 

(20)  Wk(yi>y2,...,yk;yk+i)  «  6  [y^-C^y^-a^yi/ai)]  * 


(b)  redundancy  of  English 

According  to  an  estimate  given  by  Shannon 


the  re¬ 


dundancy  y*of  written  English  relative  to  a  criterion  requiring  perfect 
distinguishability  or  different  letters  is  Z4"  0.  5.  This  figure  prob¬ 
ably  neglects  long-term  context. 


(21)  Cf.,  for  example,  ref.  21 

(22)  Ref.  22 
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VI  APPENDIX 

In  1,2  the  limitations  of  information  theory  were  illustrated 

/ 2/x ) 

by  three  examples  '  .  It  will  now  be  shown  how  the  statements 

made  there  follow  more  specifically  from  the  theory  presented  in  the 
body  of  the  report. 

(a)  The  problem  is  to  construct  the  FQ,  SQ,  and  CM  of 
Figure  4-  Assume  th’t  the  output  has,  say  a  flat  distribution  over 
(0,1),  i-e. 


(1) 


P(u) 


1  0<xil 
0  otherwise 

Imagine  the  FQ  to  convert  u  to  a  finely  quantized  form,  say 

(2) 


P(v  )  -  10"10i  (i  -  0,1,2, . . . ,10~10) 


Evaluate 


R  -  min 


P(v  ,w  ) 

p(vi'".l)loeFi^TQt^T 


with 


^(wj)  -  0  if  |vi_wj|  >  5  x  10 

and  P(v±)  as  defined  by  (2). 

Let  the  minimum  be  achieved  say  for  Q  (w.)  *  Q*  (w,). 

^i  J  vi  J 

In  order  to  build  (on  paper)  a  proper  S 0}  consider  a  system  with  input 

statistic  P(v.),  and  noise  conditions  described  by  Q*  (w.).  The  SQ 
—  J 

should  be  designed  to  operate  on  long  messages,  say  100  seconds  (■  1000 

■3 

symbols)  long.  If  we  arbitrarily  pick  10^R  of  the  possible  high- 


(24)  Cf.  1,2  for  this  section 
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probability  received  messages  of  the  system  whose  transfer  statistic 


is  Q'  (w  )  and  construct  the  fans  from  each  of  these  selected  received 
i  J 

messages  to  the  corresponding  high-probability  emitted  messages,  then, 
according  to  Corollary  V,  most  of  the  emitted  messages  will  be  covered 
by  fans.  Let  a  fan  be  called  by  the  received  message  from  which  it 
originates.  The  SQ  is  then  to  be  constructed  in  such  a  way  as  to 
code  an  emitted  message  into  the  name  of  one  of  the  fans  that  covers 
that  emitted  message.  Evidently  the  SQ  will  involve  storage  facilities 
as  well  as  reading  and  comparison  circuits. 

In  order  to  build  the  CM  it  is  necessary  to  find  the  channel 
capacity. 


(4) 


25 


2  P(xi,y.) 

p“>  ftk  p<xi-yJ)logw^7 


i  with  Qx  (yj)  -  3/4  6 


x<y 


^  where  say  and  y^  represent  the  binary  digit  0,  and 
and  y^  represent  the  digit  1. 

Let  the  maximum  be  achieved  for  say  P(x^)  -  P,f(x^).  If  1000R  <  100C, 
i.e.  if  R<  C/lO,  it  is  possible  to  code  the  sequences  at  w  into 
sequences  at  x  in  such  a  way  that,  according  to  Theorem  II  there  will 
be  no  error  in  transmission.  The  transmitted  messages  must  have  a 
statistic  P'  '  (xi)  and  the  required  CM  will  again  involve  storage, 
reading,  and  comparison  circuits. 

(b)  From  the  fact  that  the  transmission  is  band-limited 
and  subject  to  an  average  power  limitation  it  follows  that  the  speech 
should  be  coded  into  white  noise.  T  .king  100  words  per  minute  as  a 


RM-U51 

9-20-50 

-Wi¬ 


lt 


t 


reasonable  rate  of  speaking,  the  information  rate  of  speech  comes  out 
to  be  about  10  units/second  relative  to  a  fidelity  criterion  that 
requires  only  intelligibility 

The  combined  FQ,  SQ,  and  CM  necessary  would  be  a  device  that 
stores  long  speech-sound  groups,  say  sentences,  and  looks  up  the  ap¬ 
propriate  white  noise  representation  in  a  code  book.  Building  such  a 
coder  is  a  purely  technical  problem  outside  the  scope  of  information 
theory. 

If  the  speech  code  is  to  be  transmitted  without  error  over  a 
10  cps.  band  then,  according  to  111,5,(13)  the  received  signal-to- 
noise  ratio,  S/N  must  be  at  least  as  great  as  the  root  of 

(5)  10  -  101og(l+S/N)  or 

(6)  Required  S/N£  2. 

(c)  This  problem  can  be  formulated  mathematically  but  the 
formulation  is  actually  quite  useless.  If  we  assume  the  device  to 
take  photographs  of  the  sky,  and  if  only  a  finite  number  of  photo¬ 
graphs  are  possible  (e.g.  if  different  photographs  differ  only  in  that 
different  squares  of  a  rectangular  grid  are  filled  in)  there  will  be 
only  a  finite  number  of  "source  symbols",  and  it  is  only  necessary  to 
build  an  appropriate  SQ.  Let  the  possible  photographs  be  enumerated 
by  i  *  l,2,...,n,  and  the  possible  cloud  types  by  j  »  1,2,3.  In 
order  to  build  the  SQ  it  is  first  necessary  to  calculate  the  infor¬ 
mation  rate  of  the  source  subject  to  the  fidelity  criterion 

^ j)  ■  const.  Thus  it  is  necessary  to  have  an  a-priori 


(25)  A  result  of  experiments  carried  out  to  determine  the  redundancy 
of  written  English.  See  ref,  22. 
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set  of  probabilities  as  to  the  types  of  photographs  expected,  and  it 
is  also  necessary  to  know  jO(i,j)  in  terms  of  i  and  ,j.  The  latter 
requirement  3imply  means  that  it  is  necessary  to  know  a  decision 
method  for  determining  '-jhether  an  arbitrary  fixed  value  of  i  corresponds 
to  a  cloud  of  the  cirrus,  stratus,  or  cumulus  types  before  it  is  pos¬ 
sible  to  go  ahead  with  the  calculations  necessary  to  obtain  the  SQ. 
However,  finding  such  a  decision  method  is  the  entire  essence  of  the 
posed  problem. 


VII 
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